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3. STATISTICAL NETWORK 
A. Part Introduction 

The statistical network, stat(1G), is a collection of routines that can be interconnected using the UNIX oper- 
ating system shell to form numerical processing networks. Included in stat are routines to generate simple sta- 
tistics and pictorial output. Since stat routines read and write unformatted text strings, they can be readily 
used with other UNIX operating system programs. 

It is useful to think of the shell as a tool for constructing processing networks in the sense of data flow pro- 
gramming. Routines are nodes of the network, and pipes and tees are links. Data flows from node to node in 
the network via links. 

B. Basic Concepts 

A collection of numerical data in the stat network is referred to as a vector. A vector is a sequence of num- 
bers separated by delimiters. Vectors are processed by routines referred to as nodes. There are four types of 
nodes: 

1. Transformers map input vectors to output vectors. 

2. Summarizers calculate statistics of a vector. 

3. Translators convert between formats. 

4, Generators are sources of definable vectors. 


Transformers 


A transformer is a node that reads an input vector, operates on each element of the vector, and outputs 
the resulting vector. For example, if the file ab/e contains the vector: 


12345 
then the command 
root able 
produces: 
1 1.41421 1.73205 2 2.23607 
which is the square root of each input element. Also, 
log able 
produces: 
0 0.693147 1.09861 1.38629 1.60944 
which is the natural logarithm of each element of vector able. 


Another transformer, af (arithmetic function), is particularly versatile. Its argument is an expression that 
is evaluated once for each element of an input vector. For example, 
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af 27 aple*2\" 
will produce: 
2 8 18 32 50 


which is twice the square of each element from able. Expression arguments to af are usually surrounded by 
quotes since some of the operator symbols have special meaning to the shell. 


Summarizers 
A summarizer is a node that calculates a statistic for a vector. Typically, summarizers read in all of the 


input elements, then calculate and output the statistic. For example, using the vector able from the previous 
examples 


mean able 
produces: 

3 
and 

total able 
produces: 

15 


Translators 


A translator is a node that produces an output of a different structure than its input. Graphical translators 
accept input vectors and produce pictorials in GPS. Among the programs that understand GPS is ged, the 
graphical editor, which means that the graphical output of any translator can be directly edited at a display 
terminal. For example, the hist node is a translator that produces a GPS. This GPS describes a histogram de- 
rived from an input vector that consists of interval limits and counts. A wide range of x-y plots can be con- 
structed using the plot translator. 


Generators 
One way to create a vector is by using a generator. A generator is a node that accepts no input, and outputs 
a vector based upon definable parameters. The gas node is a generator that produces additive sequences. One 
of the parameters to gas is the number of elements in the generated vector. As an example, to create the vector 
able that we have been using is 
gas —n5 
which produces: 


Dee a ae ease 


Another generator is rand which is used to generate a random sequence of numbers. 
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Node Parameters 
Most nodes accept parameters to direct their operation. Parameters are specified as command-line options. 
The root transformer can compute more than square roots of the input vector. It can compute any root as speci- 
fied by the r option. For example, 
root —r3 file 
will produce: 
1 1.25992 1.44225 1.5874 1.70998 


which is the cube root of each element from able. 


Each node in the stat network has options which can be used to control the operation of that node. These 
options are described later. 


Building Networks 
Nodes are interconnected using standard UNIX operating system shell concepts and syntax. Pipes are the 
linear connector attaching the output of one node to the input of another. As an example, the mean of the cube 
roots of vector able is found by 
root —r3 able | mean 
and produces: 


1.39991 


Often the required network is not so simple. Tees and sequence can be used to build nonlinear networks. 
To find the mean and median of file, the commands 


root —r3 able | tee able2| mean; point able2 
will produce: 


1.399 
1.442 


There is a distinction between the sequence operator (;) and the linear connector, the pipe (|). Since processes 
in a pipeline run concurrently, each file name in the pipeline must be unique. Sequence implies run to completion 
(so long as & is not used); hence, names may be duplicated and often are. 

There is a special case of nonlinear networks where the result of one node is used as command-line input 
for another. Command substitution makes this easy. For example, to generate residuals from the mean of able 
is 

af " able-‘mean able’ " 


which results in: 


ES Mek ee ae 
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Vectors 
Vectors may be handled as are text files. Hence, they can be created and modified by the UNIX operating 
system text editor. A useful property of vectors is that they consist of a sequence of numbers surrounded by 
delimiters (a delimiter is anything that is not a number, sign, or decimal point). The sign of a number (+ or 
—) is optional. Thus, the vector fruit can be created with the text editor and can contain 
lapple,—3peaches,6.2pears,12lemons, 
which when read would yield the vector: 


Ta Boe Ga. ee 


When a vector is not specified as an argument to a node, input is taken from the standard input (i.e., the 
terminal) or from a pipe. 


C. Node Descriptions 

The stat nodes are divided into four types: 

1. transformers nodes 

2. summarizers nodes 

3. translators nodes 

4. generators nodes. 

All nodes accept the same command-line format: 
command [options] [filenames] 
where command is any stat node. The option is a — followed by one or more options. 

Each file argument to a node is taken as input to one occurrence of the toda: That is, the node is executed 
from its initial state once per file. If no files are given, the standard input is used. All nodes, except generators, 
accept files as input; hence, it is not made explicit in the descriptions that follow. 

Most nodes accept command-line options to direct the execution of the node. Some options take values. In 


the following descriptions to indicate the type of value associated with an option, the option key-letter is fol- 
lowed by: 


c to indicate characters 

i to indicate integer 

rg to indicate floating point or integer 
string to indicate a character string 

file to indicate a file name 


Thus the option ci, implies ¢ expects an integer value i. 


Transformers 


Transformers copy an input into an output vector after performing various mathematical or logical opera- 
tions. All transformers have a ej option, where ¢ specifies the number of columns per line in the output. The 
default condition is 5 columns per line. 
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abs 


af 


ceil 
cusum 
exp 
floor 
gamma 
list 

log 
mod 


pair 


power 
root 


round 


siline 


sin 


spline 
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[—ei] 


Compute absolute value of elements of input vector. 


[-eit v] 
Arithmetic function, where t causes the output to be titled from the vector on the standard 
input and v causes function expansion to be echoed. 


[—-ei] 


Round up each element of input vector to next integer. 


[el] 


Generate cumulative sum of elements of the input vector. 


[-ei] 
Transform each element of input vector into its appropriate exponential value. 


[el] 


Round down each element of input vector to next integer. 


[—-ei] 


Compute gamma log for each element of input vector. 


{-ci dstring] 
List vector elements which are separated by delimiters specified in string. 


[-cibf] 
Compute logarithm (base f) for each element of input vector. 


[-ci mf] 
Compute f modulus of each element of input vector. 


[-ci Ffile xi] 
Pair elements of input vector with elements of vector in file with a group size specified by 
the x argument. 


[-ci pf] 
Raise each element of input vector to the specified power. 


[-ceirf] 


Compute specified root of each element of input vector. 


[ci pisi] 

Round each element of input vector to nearest integer. Consider up to the specified number 
of places (p) after decimal point and specified number of significant digits (s). The number 
.5 rounds to 1. 


[—-ciifnisf] 
Generate a line with specified intercept (i), slope (s), and number (n) of positive integers 
using elements of input vector. 


[—ei] 


Compute sine of each element of input vector. 


[—options] 
Interpolate smooth curve. The Y and Z are sequences of X,Y coordinates (like that produced 
by pair). 
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subset [-afbfciFfileiilfnl np pf siti] 
Generate a subset from elements of input vector. The file contains a master vector. 


option — specifies 


a above 

b below 

i interval 

| leave 

nl element numbers to leave 
np element numbers to pick 
p pick 

8 start 

t terminate 


Summarizers 


A summarizer is a node that calculates a statistic for a vector. Typically, summarizers read in all input val- 
ues, calculate, and output the statistics. 


bucket [—aiciFfile hfiilfni] 
Break into buckets. 


option _ specifies 


a average size 

F bucket boundaries 

h upper limit 

i interval 

1 lower limit 

n maximum number of elements 
cor [-F file] 


Compute correlation coefficient. The file contains base vector. 


hilo [-h 1 0 ox oy] 
Find high and low values. 


option function 
h finds only high value 


1 finds only low value 
oO outputs option form (suitable for plot) 
ox prepends option form with x 
oy prepends option form with y 
lreg [-Ffileio s] 


Compute linear regression. The file contains base vector. 
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option causes 


i command to output only intercept 
o slope and intercept in option form to 
be ouput (suitable for siline) 
s only slope to be output 
mean [-ff ni pf] 


Compute (trimmed) mean as a fraction, number, or a percent. 


point [-ffni pfs] 
Compute and output point from empirical cumulative density function expressed as a frac- 
tion, number, or a percent. 


prod Generate internal product. 
qsort [-el] 
Sort elements of input vector. The —e option specifies smallest element of input vector. 
rank Rank elements of input vector. 
total Compute total sum of elements of input vector. 
var Compute variance. 
Translators 


Some translators accept vectors as input, while others accept a GPS. The title translator accepts either. 
bar {[-a bf griwixfxa yfya ylf yh? 


Build a bar chart. Elements of input vector defines heights of bars. By default, the x-axis 
is labeled with positive integers beginning at 1. 


option function 


a suppresses printing of axes 

b causes bar chart to be printed with bold weight lines 

if suppresses frame around plot area 

g suppresses background grid 

ri places bar chart in GPS region i, where jis between 
1 and 25 inclusive (default is 13) 

wi defines ratio of bar width to center-to-center 


spacing where i represents a percentage 
(default is 50) 


xf positions bar chart in GPS universe at specified 
x-origin 

yf positions bar chart in GPS universe at specified 
y-origin 

xa suppresses labeling of x-axis 

ya suppresses labeling of y-axis 

ylf defines y-axis low tick value 


yhf defines y-axis high tick value 
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hist [-a bf g rixfxa yfya ylfyhf 
Build a histogram. The input vector (of type produced by bucket) is of odd rank with odd 
elements being limits and even elements being bucket counts. 


option function 


a suppresses printing of axes 

b causes bar chart to be printed with bold weight lines 

f suppresses frame around plot area 

g suppresses background grid 

Ti places bar chart in GPS region i, where iis between 
1 and 25 inclusive (default is 13) 

xf positions bar chart in GPS universe at specified 
x-origin 

yf positions bar chart in GPS universe at specified 
y-origin 

xa suppresses labeling of x-axis 

ya suppresses labeling of y-axis 

ylf defines y-axis low tick value 


yhf defines y-axis high tick value 


label [-b c Ffileh prix xuy yr| 
Label the axis of a GPS. The input is a GPS (like that produced by hist, bar, or plot). 


option function 
b used when input is a bar chart 
c used to retain lowercase letters in labels; otherwise, 
all letters are uppercase 
Ffile the label file - each line in this file is taken 
as a label; blank lines yield null labels 


h used when input is a histogram 

p causes input to be taken as an x-y plot 

ri rotates lables i degrees 

x causes x-axis to be labeled 

xu causes label for x-axis to be placed at top of plot 

y causes y-axis to be labeled 

yr causes label for y-axis to be placed on right of plot 
pie [(-bo p pnippiri v xiyi 


Build a pie chart. The input vector has a restricted format. Each line represents a slice of 
pie and is of the form: 


[<i e f ecolor>] value [label] 


with brackets indicating optional fields. Control field options have the following effect: 
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option function 


i the slice will not be drawn, although a space will 
be left for it. 

e the slice is “exploded” or moved away from the pie. 

f the slice is filled. The angle of fill lines depends 


on the color of the slice. 

ccolor the slice ig drawn in color rather than the default 
black. Legal values for color are b for black, 
r for red, g for green, and u or blue. 


The pie is drawn with the value of each slice printed inside and the Jabel printed outside. 


option function 


b draw pie chart in bold weight lines; otherwise, use 
medium weight. 
0 output values around outside of pie. 
p output value as a percentage of total pie. 
pni output value as a percentage, but total 


of percentages equals i rather than 100. pn 100 is 
equivalent to p. 


ppi draw only i percent of a pie. 
ri put the pie chart in region i, where 
iis between 1 and 25 inclusive. Default is 13. 
Vv do not output values. 
xi position the pie chart in GPS universe 
with x-origin at i. 
yl position the pie chart in GPS universe 


with y-origin at i. 
[-a b estring d f Ffile g m rixfxa xhfxt yfya yhfyif yni yt] 


Plot a graph. The input vector contains the y values of an x-y graph. Values for the x-axis 
come from file. By default, the axes scales are determined by the first vector plotted. 
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option function 

a suppress axes. 

b plot graph with bold weight lines; otherwise, use medium. 

estring the character(s) of string are used to mark points. The first character in 
the string is used to mark the first graph, the second to mark the second 
graph, etc. Characters from string are used, in order, for each separately 
plotted graph included in the plot. If the number of characters in string is 
less than the number of plots, the last character will be used for all 
remaining plots. The m option is implied. 


d do no connect plotted points, implies option m. 
f do not build a frame around plot area. 
Ffile use file or x-values, otherwise the positive integers are used. This option 


may be used more than once, causing a different set of x-values to be 
paired with each input vector. If there are more input vectors than sets of 
x-values, the last set applies to the remaining vectors. 


g suppress the background grid. 

m mark the plotted points. 

ri put the graph in GPS region i where iis between 1 and 25 inclusive. The 
default is 18. 

xf position the graph in the universe with x-origin at f 

yf position the graph in the GPS universe with y-origin at f 

xa omit x-axis labels. 

ya omit y-axis labels. 


xhf fis the x-axis high tick value. 
yhf fis the y-axis high tick value. 


xif fis the x-axis tick increment. 

yif fis the y-axis tick increment. 

xlf fis the x-axis low tick value. 

ylf fis the y-axis low tick value. 

xni iis the approximate number of ticks on the x-axis. 
yni iis the approximate number of ticks on the x-axis. 
xt omit x-axis title 

yt omit y-axis title. 

title [-b ¢ Istring vstring ustring] 


Title a vector or a GPS. The input can be either a GPS or a vector. Title prefixes a title 
to a vector or appends a title to a GPS. 


option _ function 

b make the GPS title bold. 

c retain lower case letters in title, otherwise all letters are upper case. 
Istring for a GPS, generate a lower title string. 

ustring for a GSP, generate an upper title string. 

vstring for a vector,title string. 


Generators 


The stat generators are used to generate vectors. Thus, they do not accept input vectors. 


gas [-eiifnisftf 
Generate additive sequence. Output vector consists of a maximum of i elements (default 
of 10) starting at sf (default of zero) and terminating at tf (default of infinity). Each ele- 
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ment is numerically separated from the next by the specified interval if (default of 1). For 
default values, the-gas —t default value (infinity) is never reached. 


prime [-eihiliniy| 
Generate prime numbers. Output vector consists of a maximum of ni consecutive prime 
numbers (default of 10) where the first element is greater than or equal to li (default of 
2) and the last element is less than or equal to hi (default of infinity). 


rand [-cihflfmfnisij : 
Generate random sequence. Output vector consists of nielements (default of 10) which are 
random numbers generated by a multiplicative congruential generator with s as a seed 
(default of 1). All numbers will be greater than or equal to If (default of 0) and less than 
or equal to hf (default of 1). 


D. Examples 
Example 1: 


To calculate the total value of an investment held for a number of years at an interest rate compounded 
annually: 


Principal=1000 
echo Total return on $Principal units compounded annually 
echo “rates:\t\t\c"; gas —s.05,t.15,i.03 | tee rate 
for Years in 1358 
do 
echo "$Years year(s):\t\c"; af " Principal*(1+rate)~$Years "done 


Total return on 1000 units compounded annually 


rates: 0.05 0.08 0.11 0.14 
1 year(s): 1050 1080 1110 1140 

3 year(s): 1157.62 1259.71 1867.63 1481.54 
5 year(s): 1276.28 1469.83 1685.06 1925.41 


8 year(s): 1477.46 1850.93 2304.54 2852.59 


There is a distinction between vectors and constants as operands in the expression to af. Shell variables 
$Principaland $ Yearsare constants to af, while file rateis a vector. The af command executes once per element 
in rate. 

Example 2: 


To generate a bar chart of the percent of execution time consumed by each routine in a program: 


prof| cut —cl -15| sed-eld—e "/0.0/d" —-e" s/**//" >P echo. These are the execution 
percentages; cat P 
title P —v " execution time in percent " | bar —xa —y10,yh100 | 

label —br—45,FP | td 
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These are the execution percentages: 


_fork 32.9 
_creat 14.3 
_sbrk 14.3 
_read 14.3 
_open 14.3 
_prime 9.9 


Figure 3.1 shows the output of these commands. 


EXECUTION TIME IN PERCENT 


¥ A 
io ees Ne 


Fig. 3.1 —Routine Execution Time Consumed Bar Chart 


The prof command is a UNIX operating system command that generates a listing of execution times. The 
cut and sed are used to eliminate extraneous text from the output of prof (it is because verbiage can get in 
the way that stat nodes say very little). The P is a vector to title while it is a text file to cat and label. 
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Example 3: 
To plot the relationship between execution time of a program and number of processes in the process table: 


# The first program generates performance data 
for i in ‘gas —n12’ 


do 
ps —ae| we —] >>Procs& 
time prime —n1000 >/dev/null 2>>Times 
sleep 300 

done 


# The second program analyzes and plots data 
for i in real user system 
do 

grep $i Times| sed "s/$i//" 

awk —F: "{if(NF==2) print \$1*60+\ $2; else print}" | 
title —v "$i time in seconds" >$i 
siline — ‘Ireg —o,FProcs $i ‘ Procs >$i.fit 
done 


title -v " number of processes" Procs| yoo Procs 


plot —dg,FProes real —r12 >R12 
plot —ag,FProes real.fit —r12 >>R12 
plot —dg,FProcs sys —r13 >R13 

plot —ag,FProcs sys.fit —r13 >>R13 
plot —dg,FProes user —r8 >R8 

plot —ag,FProes user.f&it —r8 >>R8 
ged R12 R13 R8 


Performance data is the execution time, as reported by the time command, to generate the first 1000 prime 
numbers. The time command outputs three times for each run: 


e time in system routines 
e time in user routines 
e total real time. 
Each of these types of time is treated separately by the analysis program. 
Figure 3.2 shows the output of these commands. The short awk program converts "minutes:seconds" for- 
mat to " seconds" . The lreg command does a linear regression of time vectors on size of the process table. 
The siline command generates a line based on parameters from the regression. One plot is generated for each 


type of time. Each plot is put into a different region so that it can be displayed and manipulated simultaneously 
in the graphical editor. 
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Fig. 3.2—Relationship Between Execution Time and Number of Processes 
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